Heart Deseases Prediction¶

coraçao

Table of Contents¶

  • A. Introdução
  • B. Importação das Bibliotecas
  • C. Importação Dados

    • 2. Data understanding
    • Visualização direta dos dados
      • Shape
      • Columns
      • List of numerical features
      • List of categorical features
      • descriptive_statistics
      • Number/fractions of null values
        • Visualização univariada de recursos categóricos
        • HeartDisease
        • BMI
        • SkinCancer
        • Função heart_Disease
        • AgeCategory
        • Alcohol Drinking
        • Heart Diseases Family History
        • Codificação Ordinal
        • Codificaçao One-Hot
        • Modelling
        • Model definition
        • FINAL MODEL COMPARASION
Introdução
  • A doença cardíaca descreve uma série de condições que afetam o coração. As doenças incluídas na categoria de doença cardíaca abrangem doenças dos vasos sanguíneos, como doença arterial coronariana, problemas de ritmo cardíaco (arritmias) e defeitos cardíacos congênitos, entre outros.

  • O termo "doença cardíaca" é frequentemente usado de forma intercambiável com o termo "doença cardiovascular". Doença cardiovascular geralmente se refere a condições que envolvem vasos sanguíneos estreitados ou bloqueados que podem levar a um ataque cardíaco, dor no peito (angina) ou acidente vascular cerebral. Outras condições cardíacas, como aquelas que afetam o músculo, as válvulas ou o ritmo cardíaco, também são consideradas formas de doença cardíaca.

  • A doença cardíaca é uma das principais causas de morbidade e mortalidade entre a população mundial. A previsão de doenças cardiovasculares é considerada um dos temas mais importantes na análise de dados clínicos. A quantidade de dados na indústria de cuidados de saúde é enorme. A mineração de dados transforma a grande coleção de dados brutos de saúde em informações que podem ajudar a tomar decisões e previsões informadas.

De acordo com um artigo de notícias, a doença cardíaca se mostra como a principal causa de morte tanto para mulheres quanto para homens. O artigo afirma o seguinte:

  • Aproximadamente 610.000 pessoas morrem de doença cardíaca nos Estados Unidos a cada ano - isso representa 1 em cada 4 mortes.

  • A doença cardíaca é a principal causa de morte tanto para homens quanto para mulheres. Mais da metade das mortes por doença cardíaca em 2009 ocorreram em homens.

  • A doença arterial coronariana (DAC) é o tipo mais comum de doença cardíaca, causando mais de 370.000 mortes anualmente.

  • Todos os anos, cerca de 735.000 americanos sofrem um ataque cardíaco. Destes, 525.000 são um primeiro ataque cardíaco e 210.000 ocorrem em pessoas que já tiveram um ataque cardíaco anteriormente.

Importando Bibliotecas
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.model_selection import train_test_split
from pandas_profiling import ProfileReport
%matplotlib inline
import plotly.express as px
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc

pd.set_option('display.max_columns', None)
Importando Dados

Dados¶

# Feature Description
1 HeartDisease Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
2 BMI Body Mass Index (BMI)
3 Smoking Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
4 AlcoholDrinking Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
5 Stroke (Ever told) (you had) a stroke?
6 PhysicalHealth Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30
7 MentalHealth Thinking about your mental health, for how many days during the past 30 days was your mental health not good?
8 DiffWalking Do you have serious difficulty walking or climbing stairs?
9 Sex Are you male or female?
10 AgeCategory Fourteen-level age category
11 Race Imputed race/ethnicity value
12 Diabetic (Ever told) (you had) diabetes?
13 PhysicalActivity Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
14 GenHealth Would you say that in general your health is...
15 SleepTime On average, how many hours of sleep do you get in a 24-hour period?
16 Asthma (Ever told) (you had) asthma?
17 KidneyDisease Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
18 SkinCancer (Ever told) (you had) skin cancer?
19 HeartDisease_FamilyHistory Do you have family history of heart disease? 
20 State US sate (residency) 
In [2]:
df = pd.read_csv("C:/Users/victo/Documents/NUCLIOMESTRADO/heart_disease_project_data/heart_disease_data.csv")
In [3]:
df
Out[3]:
HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity GenHealth SleepTime Asthma KidneyDisease SkinCancer HeartDisease_FamilyHistory State
0 No 16.60 Yes No No 3.0 30.0 No Female 55-59 White Yes Yes Very good 5.0 Yes No Yes No MT
1 No 20.34 No NaN Yes 0.0 0.0 No Female 80 or older White No Yes Very good 7.0 No No No NaN VT
2 No 26.58 Yes NaN No 20.0 30.0 No Male 65-69 White Yes Yes Fair 8.0 Yes No No NaN WY
3 No 24.21 No NaN No 0.0 0.0 No Female 75-79 White No No Good 6.0 No No Yes No VT
4 No 23.71 No No No 28.0 0.0 Yes Female 40-44 White No Yes Very good 8.0 No No No NaN DC
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
319790 Yes 27.41 Yes No No 7.0 0.0 Yes Male 60-64 Hispanic Yes No Fair 6.0 Yes No No NaN AZ
319791 No 29.84 Yes NaN No 0.0 0.0 No Male 35-39 Hispanic No Yes Very good 5.0 Yes No No No NH
319792 No 24.24 No No No 0.0 0.0 No Female 45-49 Hispanic No Yes Good 6.0 No No No NaN DE
319793 No 32.81 No NaN No 0.0 0.0 No Female 25-29 Hispanic No No Good 12.0 No No No NaN UT
319794 No 46.56 No NaN No 0.0 0.0 No Female 80 or older Hispanic No Yes Good 8.0 No No No NaN OR

319795 rows × 20 columns

Data Understanding ¶

In [4]:
df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 20 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   HeartDisease                319795 non-null  object 
 1   BMI                         319795 non-null  float64
 2   Smoking                     319795 non-null  object 
 3   AlcoholDrinking             212984 non-null  object 
 4   Stroke                      318683 non-null  object 
 5   PhysicalHealth              319795 non-null  float64
 6   MentalHealth                319795 non-null  float64
 7   DiffWalking                 319795 non-null  object 
 8   Sex                         319795 non-null  object 
 9   AgeCategory                 319795 non-null  object 
 10  Race                        319795 non-null  object 
 11  Diabetic                    319795 non-null  object 
 12  PhysicalActivity            319795 non-null  object 
 13  GenHealth                   319795 non-null  object 
 14  SleepTime                   319795 non-null  float64
 15  Asthma                      319795 non-null  object 
 16  KidneyDisease               319795 non-null  object 
 17  SkinCancer                  319446 non-null  object 
 18  HeartDisease_FamilyHistory  35263 non-null   object 
 19  State                       319795 non-null  object 
dtypes: float64(4), object(16)
memory usage: 48.8+ MB

Visualização direta dos dados¶

Shape ¶

In [5]:
 df.shape
Out[5]:
(319795, 20)

Columns ¶

In [6]:
df.columns
Out[6]:
Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
       'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',
       'Asthma', 'KidneyDisease', 'SkinCancer', 'HeartDisease_FamilyHistory',
       'State'],
      dtype='object')

List of numerical features ¶

In [7]:
numeric_features = df.select_dtypes(include=[np.number])
numeric_features.columns
Out[7]:
Index(['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime'], dtype='object')

List of categorical features ¶

In [8]:
categorical_features = df.select_dtypes(include=[object])
categorical_features.columns
Out[8]:
Index(['HeartDisease', 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking',
       'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity',
       'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer',
       'HeartDisease_FamilyHistory', 'State'],
      dtype='object')
In [9]:
df.duplicated().sum()
Out[9]:
254
In [10]:
#drop duplicates
df.drop_duplicates(inplace=True)

Descriptive statistics¶

In [11]:
df.describe(include=['object']).T
Out[11]:
count unique top freq
HeartDisease 319541 2 No 292168
Smoking 319541 2 No 187692
AlcoholDrinking 212788 2 No 191014
Stroke 318429 2 No 306360
DiffWalking 319541 2 No 275131
Sex 319541 2 Female 167696
AgeCategory 319541 14 65-69 34108
Race 319541 6 White 244962
Diabetic 319541 4 No 269403
PhysicalActivity 319541 2 Yes 247705
GenHealth 319541 5 Very good 113727
Asthma 319541 2 No 276671
KidneyDisease 319541 2 No 307762
SkinCancer 319192 2 No 289377
HeartDisease_FamilyHistory 35261 2 No 32006
State 319541 51 OH 6417

Number/fractions of null values¶

In [12]:
df.isnull().sum()
Out[12]:
HeartDisease                       0
BMI                                0
Smoking                            0
AlcoholDrinking               106753
Stroke                          1112
PhysicalHealth                     0
MentalHealth                       0
DiffWalking                        0
Sex                                0
AgeCategory                        0
Race                               0
Diabetic                           0
PhysicalActivity                   0
GenHealth                          0
SleepTime                          0
Asthma                             0
KidneyDisease                      0
SkinCancer                       349
HeartDisease_FamilyHistory    284280
State                              0
dtype: int64
In [13]:
df.isnull().sum()/len(df)*100
Out[13]:
HeartDisease                   0.000000
BMI                            0.000000
Smoking                        0.000000
AlcoholDrinking               33.408232
Stroke                         0.347999
PhysicalHealth                 0.000000
MentalHealth                   0.000000
DiffWalking                    0.000000
Sex                            0.000000
AgeCategory                    0.000000
Race                           0.000000
Diabetic                       0.000000
PhysicalActivity               0.000000
GenHealth                      0.000000
SleepTime                      0.000000
Asthma                         0.000000
KidneyDisease                  0.000000
SkinCancer                     0.109219
HeartDisease_FamilyHistory    88.965109
State                          0.000000
dtype: float64
In [14]:
import missingno as msno
msno.bar(df)
plt.show()

Visualização univariada de recursos categóricos¶

In [15]:
def categorical_feature_func(df, categorical_features):
    num_features = len(categorical_features)
    plt.figure(figsize=(25, 20))

    for i, feature in enumerate(categorical_features, 1):
        plt.subplot(4, 4, i)
        sns.set(palette='Paired')
        sns.set_style("ticks")
        ax = sns.countplot(x=feature, data=df)
        ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")

    plt.tight_layout()
    plt.show()

categorical_feature_func(df, categorical_features)
In [16]:
numeric_features = df.select_dtypes(include=[np.number])

plt.figure(figsize = (25,15))
for i, feature in enumerate(numeric_features.columns):
    plt.subplot(2,2,i + 1)
    sns.set(palette='dark')
    sns.set_style("ticks")
    sns.histplot(df[feature],kde=True)
    plt.xlabel(feature)
    plt.ylabel("Count")

HeartDisease ¶

In [17]:
df['HeartDisease']= df['HeartDisease'].replace(['Yes', 'No'] , [1,0])

Proporção de doenças cardíacas a partir dos dados¶

In [18]:
plt.figure(figsize = (10, 6))
# Configurar fundo branco
plt.style.use("seaborn-v0_8-whitegrid")

# Proporção de doenças cardíacas a partir dos dados
plt.pie(x=df['HeartDisease'].value_counts(),
        autopct='%1.3f%%',
        labels=df['HeartDisease'].value_counts().index,
        colors=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
        wedgeprops={'linewidth': 1, 'edgecolor': 'white'})

# Configurar título e legenda
plt.title('Proporção de doenças cardíacas')
plt.legend()

# Ajustar layout para melhor aparência
plt.tight_layout()

# Exibir o gráfico
plt.show()

BMI ¶

In [19]:
pd.set_option('display.max_rows', None)
df['BMI'].describe()
Out[19]:
count    319541.000000
mean         28.328993
std           6.371116
min          12.020000
25%          24.030000
50%          27.340000
75%          31.450000
max         119.000000
Name: BMI, dtype: float64
In [20]:
# Criar o gráfico de dispersão
fig = px.scatter(df, x="BMI", y="SleepTime")

# Exibir o gráfico
fig.show()

SkinCancer ¶

In [21]:
# Verificando valores vazios na coluna 'SkinCancer'
df['SkinCancer'].isna().sum()
Out[21]:
349
In [22]:
# Remover linhas com valores vazios na coluna 'SkinCancer'
df.dropna(subset=['SkinCancer'], inplace=True)

Stroke ¶

In [23]:
# Verificando valores vazios na coluna 'Stroke'
df['Stroke'].isna().sum()
Out[23]:
1111
In [24]:
# Remover linhas com valores vazios na coluna 'Stroke'
df.dropna(subset=['Stroke'], inplace=True)

Função heart_Disease ¶

In [25]:
import plotly.express as px

def heart_Disease_Func(data, column, count=True):
    unique_values = data[column].unique()
    null_count = data[column].isnull().sum()
    value_counts = data[column].value_counts()
    
    print(f'Quantidade de valores únicos: {len(unique_values)}')
    print(f'\nQuais são os valores únicos: {unique_values}')
    print(f'\nQuantidade de valores nulos: {null_count}')
    print(f'\nQuantidade por opção: \n{value_counts}')
    
    if count:
        fig = px.histogram(data, x=column, color='HeartDisease', barmode='group')
        fig.show()
    else:
        fig = px.histogram(data, x=column, marginal='kde')
        fig.show()
In [26]:
heart_Disease_Func(df, 'HeartDisease_FamilyHistory')
Quantidade de valores únicos: 3

Quais são os valores únicos: ['No' nan 'Yes']

Quantidade de valores nulos: 282992

Quantidade por opção: 
No     31856
Yes     3233
Name: HeartDisease_FamilyHistory, dtype: int64
In [27]:
heart_Disease_Func(df, 'SleepTime')
Quantidade de valores únicos: 24

Quais são os valores únicos: [ 5.  7.  8.  6. 12.  4.  9. 10. 15.  3.  2.  1. 16. 18. 14. 20. 11. 13.
 17. 24. 19. 21. 22. 23.]

Quantidade de valores nulos: 0

Quantidade por opção: 
7.0     97164
8.0     97066
6.0     66388
5.0     19098
9.0     15972
10.0     7762
4.0      7712
12.0     2196
3.0      1981
2.0       786
1.0       544
11.0      415
14.0      243
16.0      236
15.0      189
18.0      102
13.0       95
20.0       64
24.0       30
17.0       21
22.0        9
19.0        3
23.0        3
21.0        2
Name: SleepTime, dtype: int64
In [28]:
heart_Disease_Func(df, 'Race')
Quantidade de valores únicos: 6

Quais são os valores únicos: ['White' 'Black' 'Asian' 'American Indian/Alaskan Native' 'Other'
 'Hispanic']

Quantidade de valores nulos: 0

Quantidade por opção: 
White                             243825
Hispanic                           27317
Black                              22846
Other                              10887
Asian                               8029
American Indian/Alaskan Native      5177
Name: Race, dtype: int64
In [29]:
heart_Disease_Func(df, 'AgeCategory')
Quantidade de valores únicos: 14

Quais são os valores únicos: ['55-59' '80 or older' '65-69' '75-79' '40-44' '70-74' '60-64' '50-54'
 '45-49' '18-24' '35-39' '30-34' '25-29' '0']

Quantidade de valores nulos: 0

Quantidade por opção: 
65-69          33965
60-64          33485
70-74          30904
55-59          29589
50-54          25210
80 or older    24063
45-49          21662
75-79          21383
18-24          20959
40-44          20900
35-39          20429
30-34          18627
25-29          16847
0                 58
Name: AgeCategory, dtype: int64
In [30]:
heart_Disease_Func(df, 'MentalHealth')
Quantidade de valores únicos: 31

Quais são os valores únicos: [30.  0.  2.  5. 15.  8.  4.  3. 10. 14. 20.  1.  7. 24.  9. 28. 16. 12.
 25. 17. 18. 21. 29.  6. 22. 13. 23. 27. 26. 11. 19.]

Quantidade de valores nulos: 0

Quantidade por opção: 
0.0     204227
30.0     17297
2.0      16417
5.0      14088
10.0     10464
3.0      10411
15.0      9845
1.0       9242
7.0       5505
20.0      5397
4.0       5355
14.0      2042
25.0      1945
6.0       1504
8.0       1091
12.0       755
28.0       509
21.0       350
29.0       316
18.0       211
9.0        203
16.0       151
17.0       127
27.0       125
13.0       110
22.0        98
11.0        83
23.0        67
24.0        66
26.0        59
19.0        21
Name: MentalHealth, dtype: int64
In [31]:
heart_Disease_Func(df, 'PhysicalHealth')
Quantidade de valores únicos: 31

Quais são os valores únicos: [ 3.  0. 20. 28.  6. 15.  5. 30.  7.  1.  2. 21.  4. 10. 14. 18.  8. 25.
 16. 29. 27. 17. 24. 12. 23. 26. 22. 19.  9. 13. 11.]

Quantidade de valores nulos: 0

Quantidade por opção: 
0.0     225304
30.0     19424
2.0      14808
1.0      10436
3.0       8586
5.0       7575
10.0      5425
15.0      4990
7.0       4605
4.0       4443
20.0      3197
14.0      2878
6.0       1265
25.0      1156
8.0        919
21.0       626
12.0       604
28.0       444
29.0       203
9.0        180
18.0       167
16.0       135
27.0       124
17.0       109
13.0        91
22.0        89
11.0        84
24.0        67
26.0        66
23.0        46
19.0        35
Name: PhysicalHealth, dtype: int64

AgeCategory ¶

In [32]:
df.AgeCategory.value_counts()
Out[32]:
65-69          33965
60-64          33485
70-74          30904
55-59          29589
50-54          25210
80 or older    24063
45-49          21662
75-79          21383
18-24          20959
40-44          20900
35-39          20429
30-34          18627
25-29          16847
0                 58
Name: AgeCategory, dtype: int64
In [33]:
df['AgeCategory'].unique()
Out[33]:
array(['55-59', '80 or older', '65-69', '75-79', '40-44', '70-74',
       '60-64', '50-54', '45-49', '18-24', '35-39', '30-34', '25-29', '0'],
      dtype=object)
In [34]:
df.drop(df[df['AgeCategory'] == '0'].index, inplace=True)
In [35]:
# Categorizing Age Groups in DataFrame
df['AgeCategory']=df['AgeCategory'].replace(['18-24','25-29','30-34'],'Jovem')
df['AgeCategory']=df['AgeCategory'].replace(['35-39','40-44','45-49','50-54'],'Adulto')
df['AgeCategory']=df['AgeCategory'].replace(['55-59','60-64','65-69','70-74'],'Idoso')
df['AgeCategory']=df['AgeCategory'].replace(['75-79','80 or older'],'Velho')
In [36]:
# Definir a ordem específica das categorias
order = ['Jovem', 'Adulto', 'Idoso', 'Velho']

# Criar um dicionário mapeando cada categoria para seu valor ordinal
mapping = {category: i for i, category in enumerate(order)}

# Aplicar a codificação ordinal na coluna 'GenHealth'
df['AgeCategory'] = df['AgeCategory'].map(mapping)
In [37]:
df['AgeCategory'].unique()
Out[37]:
array([2, 3, 1, 0], dtype=int64)
In [38]:
import plotly.express as px

# Agrupar por 'AgeCategory' e 'HeartDisease' e contar ocorrências
data = df.groupby(['AgeCategory', 'HeartDisease']).size().unstack()

# Configurar cores
cores = ['#71AEC2', '#D58989']

# Criar gráfico de barras interativo
fig = px.bar(data_frame=data, x=data.index, y=data.columns, color_discrete_sequence=cores)

# Configurar título e rótulos dos eixos
fig.update_layout(title='Frequência de doenças cardíacas por categoria de idade',
                  xaxis_title='Categorias de Idade',
                  yaxis_title='Frequencia')

# Exibir o gráfico interativo
fig.show()

Alcohol Drinking ¶

In [39]:
# Verificando valores nulos na coluna AlcoholDrinking
df['AlcoholDrinking'].value_counts(dropna=False)
Out[39]:
No     190078
NaN    106272
Yes     21673
Name: AlcoholDrinking, dtype: int64
In [40]:
# Verificando valores unicos na coluna AlcoholDrinking
df['AlcoholDrinking'].unique()
Out[40]:
array(['No', nan, 'Yes'], dtype=object)
In [41]:
# Preenchendo os valores nulos da coluna AlcoholDrinking
df['AlcoholDrinking'].fillna(value='ZZZ', inplace=True)
In [42]:
df['AlcoholDrinking'].unique()
Out[42]:
array(['No', 'ZZZ', 'Yes'], dtype=object)
In [43]:
df['AlcoholDrinking'].value_counts(dropna=False)
Out[43]:
No     190078
ZZZ    106272
Yes     21673
Name: AlcoholDrinking, dtype: int64

Heart Disease Family History ¶

In [44]:
## # Verificando valores nulos na coluna HeartDisease_FamilyHistory
df['HeartDisease_FamilyHistory'].value_counts(dropna=False)
Out[44]:
NaN    282938
No      31853
Yes      3232
Name: HeartDisease_FamilyHistory, dtype: int64
In [45]:
# Preenchendo os valores nulos da coluna HeartDisease_FamilyHistory
df['HeartDisease_FamilyHistory'].fillna(value='XXX', inplace=True)
In [46]:
df['HeartDisease_FamilyHistory'].value_counts(dropna=False)
Out[46]:
XXX    282938
No      31853
Yes      3232
Name: HeartDisease_FamilyHistory, dtype: int64
In [47]:
# Matrix transposta do df 
df.head().T
Out[47]:
0 1 2 3 4
HeartDisease 0 0 0 0 0
BMI 16.6 20.34 26.58 24.21 23.71
Smoking Yes No Yes No No
AlcoholDrinking No ZZZ ZZZ ZZZ No
Stroke No Yes No No No
PhysicalHealth 3.0 0.0 20.0 0.0 28.0
MentalHealth 30.0 0.0 30.0 0.0 0.0
DiffWalking No No No No Yes
Sex Female Female Male Female Female
AgeCategory 2 3 2 3 1
Race White White White White White
Diabetic Yes No Yes No No
PhysicalActivity Yes Yes Yes No Yes
GenHealth Very good Very good Fair Good Very good
SleepTime 5.0 7.0 8.0 6.0 8.0
Asthma Yes No Yes No No
KidneyDisease No No No No No
SkinCancer Yes No No Yes No
HeartDisease_FamilyHistory No XXX XXX No XXX
State MT VT WY VT DC

Codificação Ordinal ¶

GenHealth¶

In [48]:
df['GenHealth'].unique()
Out[48]:
array(['Very good', 'Fair', 'Good', 'Poor', 'Excellent'], dtype=object)
In [49]:
df['GenHealth'].value_counts(dropna=False)
Out[49]:
Very good    113183
Good          92693
Excellent     66395
Fair          34506
Poor          11246
Name: GenHealth, dtype: int64
In [50]:
# Definir a ordem específica das categorias
order = ['Poor', 'Fair', 'Good', 'Very good', 'Excellent']

# Criar um dicionário mapeando cada categoria para seu valor ordinal
mapping = {category: i for i, category in enumerate(order)}

# Aplicar a codificação ordinal na coluna 'GenHealth'
df['GenHealth'] = df['GenHealth'].map(mapping)
In [51]:
df['GenHealth'].unique()
Out[51]:
array([3, 1, 2, 0, 4], dtype=int64)
In [52]:
df.head(1).T
Out[52]:
0
HeartDisease 0
BMI 16.6
Smoking Yes
AlcoholDrinking No
Stroke No
PhysicalHealth 3.0
MentalHealth 30.0
DiffWalking No
Sex Female
AgeCategory 2
Race White
Diabetic Yes
PhysicalActivity Yes
GenHealth 3
SleepTime 5.0
Asthma Yes
KidneyDisease No
SkinCancer Yes
HeartDisease_FamilyHistory No
State MT
In [53]:
df.apply(lambda x: x.nunique(), axis=0)
Out[53]:
HeartDisease                     2
BMI                           3606
Smoking                          2
AlcoholDrinking                  3
Stroke                           2
PhysicalHealth                  31
MentalHealth                    31
DiffWalking                      2
Sex                              2
AgeCategory                      4
Race                             6
Diabetic                         4
PhysicalActivity                 2
GenHealth                        5
SleepTime                       24
Asthma                           2
KidneyDisease                    2
SkinCancer                       2
HeartDisease_FamilyHistory       3
State                           51
dtype: int64

Sleep Time¶

In [54]:
df['SleepTime'].value_counts(dropna=False)
Out[54]:
7.0     97138
8.0     97054
6.0     66379
5.0     19093
9.0     15970
10.0     7761
4.0      7710
12.0     2196
3.0      1980
2.0       786
1.0       544
11.0      415
14.0      243
16.0      236
15.0      189
18.0      102
13.0       95
20.0       64
24.0       30
17.0       21
22.0        9
19.0        3
23.0        3
21.0        2
Name: SleepTime, dtype: int64
In [55]:
# Definir a ordem específica das categorias (valores únicos em ordem crescente)
order = sorted(df['SleepTime'].unique())

# Criar um dicionário mapeando cada valor para seu valor ordinal (1 a 24)
mapping = {value: i+1 for i, value in enumerate(order)}

# Aplicar a codificação ordinal na coluna 'SleepTime' no DataFrame 'df'
df['SleepTime'] = df['SleepTime'].replace(mapping)
In [56]:
df['SleepTime'].value_counts(dropna=False)
Out[56]:
7.0     97138
8.0     97054
6.0     66379
5.0     19093
9.0     15970
10.0     7761
4.0      7710
12.0     2196
3.0      1980
2.0       786
1.0       544
11.0      415
14.0      243
16.0      236
15.0      189
18.0      102
13.0       95
20.0       64
24.0       30
17.0       21
22.0        9
19.0        3
23.0        3
21.0        2
Name: SleepTime, dtype: int64
In [57]:
df['SleepTime'].unique()
Out[57]:
array([ 5.,  7.,  8.,  6., 12.,  4.,  9., 10., 15.,  3.,  2.,  1., 16.,
       18., 14., 20., 11., 13., 17., 24., 19., 21., 22., 23.])

PhysicalHealth¶

In [58]:
df['PhysicalHealth'].value_counts(dropna=False)
Out[58]:
0.0     225263
30.0     19422
2.0      14805
1.0      10432
3.0       8585
5.0       7575
10.0      5423
15.0      4990
7.0       4605
4.0       4443
20.0      3197
14.0      2877
6.0       1265
25.0      1155
8.0        919
21.0       625
12.0       603
28.0       444
29.0       203
9.0        180
18.0       166
16.0       135
27.0       124
17.0       109
13.0        91
22.0        89
11.0        84
24.0        67
26.0        66
23.0        46
19.0        35
Name: PhysicalHealth, dtype: int64
In [59]:
# Convertendo a coluna "PhysicalHealth" para o tipo int
df['PhysicalHealth'] = df['PhysicalHealth'].astype(int)

MentalHealth¶

In [60]:
df['MentalHealth'].value_counts(dropna=False)
Out[60]:
0.0     204191
30.0     17294
2.0      16413
5.0      14084
10.0     10462
3.0      10409
15.0      9842
1.0       9242
7.0       5505
20.0      5396
4.0       5353
14.0      2041
25.0      1945
6.0       1504
8.0       1091
12.0       755
28.0       509
21.0       350
29.0       316
18.0       211
9.0        203
16.0       151
17.0       127
27.0       125
13.0       110
22.0        98
11.0        83
23.0        67
24.0        66
26.0        59
19.0        21
Name: MentalHealth, dtype: int64
In [61]:
#Convertendo a coluna "MentalHealth" para o tipo int
df['MentalHealth'] = df['MentalHealth'].astype(int)
In [62]:
df.head().T
Out[62]:
0 1 2 3 4
HeartDisease 0 0 0 0 0
BMI 16.6 20.34 26.58 24.21 23.71
Smoking Yes No Yes No No
AlcoholDrinking No ZZZ ZZZ ZZZ No
Stroke No Yes No No No
PhysicalHealth 3 0 20 0 28
MentalHealth 30 0 30 0 0
DiffWalking No No No No Yes
Sex Female Female Male Female Female
AgeCategory 2 3 2 3 1
Race White White White White White
Diabetic Yes No Yes No No
PhysicalActivity Yes Yes Yes No Yes
GenHealth 3 3 1 2 3
SleepTime 5.0 7.0 8.0 6.0 8.0
Asthma Yes No Yes No No
KidneyDisease No No No No No
SkinCancer Yes No No Yes No
HeartDisease_FamilyHistory No XXX XXX No XXX
State MT VT WY VT DC
In [63]:
df.apply(lambda x: x.unique(), axis=0)
Out[63]:
HeartDisease                                                             [0, 1]
BMI                           [16.6, 20.34, 26.58, 24.21, 23.71, 28.87, 21.6...
Smoking                                                               [Yes, No]
AlcoholDrinking                                                  [No, ZZZ, Yes]
Stroke                                                                [No, Yes]
PhysicalHealth                [3, 0, 20, 28, 6, 15, 5, 30, 7, 1, 2, 21, 4, 1...
MentalHealth                  [30, 0, 2, 5, 15, 8, 4, 3, 10, 14, 20, 1, 7, 2...
DiffWalking                                                           [No, Yes]
Sex                                                              [Female, Male]
AgeCategory                                                        [2, 3, 1, 0]
Race                          [White, Black, Asian, American Indian/Alaskan ...
Diabetic                      [Yes, No, No, borderline diabetes, Yes (during...
PhysicalActivity                                                      [Yes, No]
GenHealth                                                       [3, 1, 2, 0, 4]
SleepTime                     [5.0, 7.0, 8.0, 6.0, 12.0, 4.0, 9.0, 10.0, 15....
Asthma                                                                [Yes, No]
KidneyDisease                                                         [No, Yes]
SkinCancer                                                            [Yes, No]
HeartDisease_FamilyHistory                                       [No, XXX, Yes]
State                         [MT, VT, WY, DC, PA, AK, KY, DE, CA, NM, WI, V...
dtype: object

Codificação One-Hot ¶

Get Dummies¶

In [64]:
#criando a funçao para modificar as colunas com One hoting Encoding

def OHE(dataframe, column_name):
    dummy_dataset = pd.get_dummies(dataframe[column_name], prefix=column_name)
    dataframe = pd.concat([dataframe, dummy_dataset], axis=1)
    dataframe.drop(column_name, axis=1, inplace=True)
    del dummy_dataset

    return dataframe

Smoking¶

In [65]:
df = OHE(df, 'Smoking')

AlcoholDrinking¶

In [66]:
df['AlcoholDrinking'].value_counts()
Out[66]:
No     190078
ZZZ    106272
Yes     21673
Name: AlcoholDrinking, dtype: int64
In [67]:
df = OHE(df, 'AlcoholDrinking')

Stroke¶

In [68]:
df['Stroke'].value_counts()
Out[68]:
No     305966
Yes     12057
Name: Stroke, dtype: int64
In [69]:
df = OHE(df, 'Stroke')

DiffWalking¶

In [70]:
df['DiffWalking'].value_counts()
Out[70]:
No     273791
Yes     44232
Name: DiffWalking, dtype: int64
In [71]:
df = OHE(df, 'DiffWalking')

Sex¶

In [72]:
df['Sex'].value_counts()
Out[72]:
Female    166896
Male      151127
Name: Sex, dtype: int64
In [73]:
df = OHE(df, 'Sex')

Race¶

In [74]:
df['Race'].value_counts(dropna=False)
Out[74]:
White                             243783
Hispanic                           27314
Black                              22840
Other                              10884
Asian                               8026
American Indian/Alaskan Native      5176
Name: Race, dtype: int64
In [75]:
df = OHE(df, 'Race')

Diabetic¶

In [76]:
df = OHE(df, 'Diabetic')

PhysicalActivity¶

In [77]:
df = OHE(df, 'PhysicalActivity')

Asthma¶

In [78]:
df = OHE(df, 'Asthma')

KidneyDisease¶

In [79]:
df = OHE(df, 'KidneyDisease')

SkinCancer¶

In [80]:
df = OHE(df, 'SkinCancer')

HeartDisease_FamilyHistory¶

In [81]:
df = OHE(df, 'HeartDisease_FamilyHistory')
In [82]:
df.apply(lambda x: x.nunique(), axis=0)
Out[82]:
HeartDisease                              2
BMI                                    3606
PhysicalHealth                           31
MentalHealth                             31
AgeCategory                               4
GenHealth                                 5
SleepTime                                24
State                                    51
Smoking_No                                2
Smoking_Yes                               2
AlcoholDrinking_No                        2
AlcoholDrinking_Yes                       2
AlcoholDrinking_ZZZ                       2
Stroke_No                                 2
Stroke_Yes                                2
DiffWalking_No                            2
DiffWalking_Yes                           2
Sex_Female                                2
Sex_Male                                  2
Race_American Indian/Alaskan Native       2
Race_Asian                                2
Race_Black                                2
Race_Hispanic                             2
Race_Other                                2
Race_White                                2
Diabetic_No                               2
Diabetic_No, borderline diabetes          2
Diabetic_Yes                              2
Diabetic_Yes (during pregnancy)           2
PhysicalActivity_No                       2
PhysicalActivity_Yes                      2
Asthma_No                                 2
Asthma_Yes                                2
KidneyDisease_No                          2
KidneyDisease_Yes                         2
SkinCancer_No                             2
SkinCancer_Yes                            2
HeartDisease_FamilyHistory_No             2
HeartDisease_FamilyHistory_XXX            2
HeartDisease_FamilyHistory_Yes            2
dtype: int64

Drop coluna State¶

In [83]:
# Remover a coluna "state" do DataFrame
df.drop('State', axis=1, inplace=True)

Valores Exclusivos por coluna¶

In [84]:
# iterate over the list to print all unique values of each column in the dataframe
for column in list(df.columns.values):
    print(column, ':', str(df[column].unique()))
HeartDisease : [0 1]
BMI : [16.6  20.34 26.58 ... 62.42 51.46 46.56]
PhysicalHealth : [ 3  0 20 28  6 15  5 30  7  1  2 21  4 10 14 18  8 25 16 29 27 17 24 12
 23 26 22 19  9 13 11]
MentalHealth : [30  0  2  5 15  8  4  3 10 14 20  1  7 24  9 28 16 12 25 17 18 21 29  6
 22 13 23 27 26 11 19]
AgeCategory : [2 3 1 0]
GenHealth : [3 1 2 0 4]
SleepTime : [ 5.  7.  8.  6. 12.  4.  9. 10. 15.  3.  2.  1. 16. 18. 14. 20. 11. 13.
 17. 24. 19. 21. 22. 23.]
Smoking_No : [0 1]
Smoking_Yes : [1 0]
AlcoholDrinking_No : [1 0]
AlcoholDrinking_Yes : [0 1]
AlcoholDrinking_ZZZ : [0 1]
Stroke_No : [1 0]
Stroke_Yes : [0 1]
DiffWalking_No : [1 0]
DiffWalking_Yes : [0 1]
Sex_Female : [1 0]
Sex_Male : [0 1]
Race_American Indian/Alaskan Native : [0 1]
Race_Asian : [0 1]
Race_Black : [0 1]
Race_Hispanic : [0 1]
Race_Other : [0 1]
Race_White : [1 0]
Diabetic_No : [0 1]
Diabetic_No, borderline diabetes : [0 1]
Diabetic_Yes : [1 0]
Diabetic_Yes (during pregnancy) : [0 1]
PhysicalActivity_No : [0 1]
PhysicalActivity_Yes : [1 0]
Asthma_No : [0 1]
Asthma_Yes : [1 0]
KidneyDisease_No : [1 0]
KidneyDisease_Yes : [0 1]
SkinCancer_No : [0 1]
SkinCancer_Yes : [1 0]
HeartDisease_FamilyHistory_No : [1 0]
HeartDisease_FamilyHistory_XXX : [0 1]
HeartDisease_FamilyHistory_Yes : [0 1]

Modelling ¶

In [85]:
from sklearn import model_selection # model assesment and model selection strategies
from sklearn import metrics # model evaluation metrics
In [86]:
# development = train + test
dev_df_X = df.drop('HeartDisease', axis=1) # development = train + test
dev_df_y = df[['HeartDisease']]
In [87]:
# validation
val_df_X = df.drop('HeartDisease', axis=1)
val_df_y = df[['HeartDisease']]
In [88]:
dev_df_X.head().T
Out[88]:
0 1 2 3 4
BMI 16.6 20.34 26.58 24.21 23.71
PhysicalHealth 3.0 0.00 20.00 0.00 28.00
MentalHealth 30.0 0.00 30.00 0.00 0.00
AgeCategory 2.0 3.00 2.00 3.00 1.00
GenHealth 3.0 3.00 1.00 2.00 3.00
SleepTime 5.0 7.00 8.00 6.00 8.00
Smoking_No 0.0 1.00 0.00 1.00 1.00
Smoking_Yes 1.0 0.00 1.00 0.00 0.00
AlcoholDrinking_No 1.0 0.00 0.00 0.00 1.00
AlcoholDrinking_Yes 0.0 0.00 0.00 0.00 0.00
AlcoholDrinking_ZZZ 0.0 1.00 1.00 1.00 0.00
Stroke_No 1.0 0.00 1.00 1.00 1.00
Stroke_Yes 0.0 1.00 0.00 0.00 0.00
DiffWalking_No 1.0 1.00 1.00 1.00 0.00
DiffWalking_Yes 0.0 0.00 0.00 0.00 1.00
Sex_Female 1.0 1.00 0.00 1.00 1.00
Sex_Male 0.0 0.00 1.00 0.00 0.00
Race_American Indian/Alaskan Native 0.0 0.00 0.00 0.00 0.00
Race_Asian 0.0 0.00 0.00 0.00 0.00
Race_Black 0.0 0.00 0.00 0.00 0.00
Race_Hispanic 0.0 0.00 0.00 0.00 0.00
Race_Other 0.0 0.00 0.00 0.00 0.00
Race_White 1.0 1.00 1.00 1.00 1.00
Diabetic_No 0.0 1.00 0.00 1.00 1.00
Diabetic_No, borderline diabetes 0.0 0.00 0.00 0.00 0.00
Diabetic_Yes 1.0 0.00 1.00 0.00 0.00
Diabetic_Yes (during pregnancy) 0.0 0.00 0.00 0.00 0.00
PhysicalActivity_No 0.0 0.00 0.00 1.00 0.00
PhysicalActivity_Yes 1.0 1.00 1.00 0.00 1.00
Asthma_No 0.0 1.00 0.00 1.00 1.00
Asthma_Yes 1.0 0.00 1.00 0.00 0.00
KidneyDisease_No 1.0 1.00 1.00 1.00 1.00
KidneyDisease_Yes 0.0 0.00 0.00 0.00 0.00
SkinCancer_No 0.0 1.00 1.00 0.00 1.00
SkinCancer_Yes 1.0 0.00 0.00 1.00 0.00
HeartDisease_FamilyHistory_No 1.0 0.00 0.00 1.00 0.00
HeartDisease_FamilyHistory_XXX 0.0 1.00 1.00 0.00 1.00
HeartDisease_FamilyHistory_Yes 0.0 0.00 0.00 0.00 0.00
In [89]:
dev_df_y.head().T
Out[89]:
0 1 2 3 4
HeartDisease 0 0 0 0 0

Determine validation strategy (Random Holdout) & partition policy for test set (random)¶

In [90]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
                                        dev_df_X, # X 
                                        dev_df_y, # y
                                        test_size = 0.30, 
                                        random_state = 42
                                     )
In [91]:
X_train.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 222616 entries, 171781 to 122586
Columns: 38 entries, BMI to HeartDisease_FamilyHistory_Yes
dtypes: float64(2), int32(2), int64(2), uint8(32)
memory usage: 17.0 MB
In [92]:
X_test.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95407 entries, 304097 to 317388
Columns: 38 entries, BMI to HeartDisease_FamilyHistory_Yes
dtypes: float64(2), int32(2), int64(2), uint8(32)
memory usage: 7.3 MB
In [93]:
X_train.describe().T.head()
Out[93]:
count mean std min 25% 50% 75% max
BMI 222616.0 28.335417 6.377846 12.02 24.03 27.34 31.45 119.0
PhysicalHealth 222616.0 3.372188 7.952641 0.00 0.00 0.00 2.00 30.0
MentalHealth 222616.0 3.891683 7.944438 0.00 0.00 0.00 3.00 30.0
AgeCategory 222616.0 1.509339 0.943634 0.00 1.00 2.00 2.00 3.0
GenHealth 222616.0 2.594176 1.043680 0.00 2.00 3.00 3.00 4.0
In [94]:
X_test.describe().T.head()
Out[94]:
count mean std min 25% 50% 75% max
BMI 95407.0 28.314242 6.356699 12.02 24.0 27.32 31.45 92.53
PhysicalHealth 95407.0 3.380381 7.957179 0.00 0.0 0.00 2.00 30.00
MentalHealth 95407.0 3.922249 7.986935 0.00 0.0 0.00 3.00 30.00
AgeCategory 95407.0 1.513746 0.943889 0.00 1.0 2.00 2.00 3.00
GenHealth 95407.0 2.594317 1.041420 0.00 2.0 3.00 3.00 4.00
In [95]:
y_train.describe().T.head()
Out[95]:
count mean std min 25% 50% 75% max
HeartDisease 222616.0 0.085744 0.279986 0.0 0.0 0.0 0.0 1.0
In [96]:
y_test.describe().T.head()
Out[96]:
count mean std min 25% 50% 75% max
HeartDisease 95407.0 0.085633 0.279823 0.0 0.0 0.0 0.0 1.0
In [ ]:
 

Model definition ¶

DecisionTreeClassifier¶

In [97]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
In [98]:
# Criar o modelo Decision Tree Classifier com hiperparâmetros ajustados
DT = DecisionTreeClassifier(max_depth=4, random_state=42)

# Treinar o modelo
DT.fit(X_train, y_train)

# Fazer previsões no conjunto de teste
y_pred1 = DT.predict(X_test)

# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred1)
print("Acurácia do modelo Decision Tree Classifier: {:.2f}%".format(accuracy * 100))

# Realizar validação cruzada
cv_scores = cross_val_score(DT, X_train, y_train, cv=5)
print("Acurácia da validação cruzada (média): {:.2f}%".format(cv_scores.mean() * 100))

# Visualizar a árvore de decisão
fig, ax = plt.subplots(figsize=(40, 20))
tree.plot_tree(DT,
               ax=ax,
               fontsize=12,
               proportion=True,
               filled=True,
               feature_names=X_train.columns)
plt.show()
Acurácia do modelo Decision Tree Classifier: 91.47%
Acurácia da validação cruzada (média): 91.47%

KNeighbors Classifier¶

In [99]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from tqdm import tqdm_notebook
In [100]:
# Transformar o array de rótulos em uma matriz unidimensional
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

# Criar o modelo KNN
KNN = KNeighborsClassifier(n_neighbors=3)  # Defina o número de vizinhos desejado

# Treinar o modelo
KNN.fit(X_train, y_train)

# Fazer previsões no conjunto de teste
y_pred2 = KNN.predict(X_test)

# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred2)
print("Acurácia do modelo KNeighbors Classifier é: {:.2f}%".format(accuracy * 100))

# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = KNN.predict_proba(X_test)[:, 1]

# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC - KNeighbors Classifier')
plt.legend(loc="lower right")
plt.show()
Acurácia do modelo KNeighbors Classifier é: 90.04%

RandomForestClassifier¶

In [101]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc


# Criar o modelo Random Forest Classifier
RF = RandomForestClassifier(n_estimators=100)  # Defina o número de estimadores desejado

# Treinar o modelo
RF.fit(X_train, y_train)

# Fazer previsões no conjunto de teste
y_pred3 = RF.predict(X_test)

# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred3)
print("Acurácia do modelo: {:.2f}%".format(accuracy * 100))

# Exibir a matriz de confusão
confusion_mat = confusion_matrix(y_test, y_pred3)
print("Matriz de Confusão:")
print(confusion_mat)

# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = RF.predict_proba(X_test)[:, 1]

# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC')
plt.legend(loc="lower right")
plt.show()
Acurácia do modelo: 90.41%
Matriz de Confusão:
[[85279  1958]
 [ 7195   975]]

Gradient Boosting Classifier¶

In [102]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
In [103]:
# Criar o modelo Gradient Boosting Classifier
GB = GradientBoostingClassifier()

# Treinar o modelo
GB.fit(X_train, y_train)

# Fazer previsões no conjunto de teste
y_pred4 = GB.predict(X_test)

# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred4)
print("Acurácia do modelo Gradient Boosting Classifier: {:.2f}%".format(accuracy * 100))

# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = GB.predict_proba(X_test)[:, 1]

# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC - Gradient Boosting Classifier')
plt.legend(loc="lower right")
plt.show()
Acurácia do modelo Gradient Boosting Classifier: 91.59%

Logistic Regression¶

In [104]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
In [105]:
# Pré-processar os dados - Padronizar os recursos
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Criar o modelo de regressão logística
LR = LogisticRegression()

# Treinar o modelo usando os dados de treinamento
LR.fit(X_train, y_train)

# Fazer previsões usando os dados de teste
y_pred5 = LR.predict(X_test)

# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred5)
print("Acurácia:", accuracy)

# Calcular outras métricas de avaliação
print("Relatório de Classificação:")
print(classification_report(y_test, y_pred5))

# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = LR.predict_proba(X_test)[:, 1]

# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC - Regressão Logística')
plt.legend(loc="lower right")
plt.show()
Acurácia: 0.915603676879055
Relatório de Classificação:
              precision    recall  f1-score   support

           0       0.92      0.99      0.96     87237
           1       0.54      0.11      0.18      8170

    accuracy                           0.92     95407
   macro avg       0.73      0.55      0.57     95407
weighted avg       0.89      0.92      0.89     95407


FINAL MODEL COMPARISON ¶

 (Decison Tree vs  KNeighborsClassifier vs Random Forest vs Gradient Boosting vs LogisticRegression )
In [106]:
final_data = pd.DataFrame ({ 'MODELOS': [ 'DT', 'KNN', 'RF', 'GB', 'LR'], 
                                        'ACC': [accuracy_score(y_test, y_pred1), 
                                                accuracy_score(y_test, y_pred2),
                                                accuracy_score(y_test, y_pred3),
                                                accuracy_score(y_test, y_pred4),
                                                accuracy_score(y_test, y_pred5)]})
final_data
Out[106]:
MODELOS ACC
0 DT 0.914713
1 KNN 0.900437
2 RF 0.904064
3 GB 0.915950
4 LR 0.915604
In [107]:
RANDOM_STATE = 42
n_estimators = 50
max_depth = 5

models = [ 
    ('DT', DecisionTreeClassifier(max_depth=max_depth, random_state=42)),
    ('KNN', KNeighborsClassifier()),
    ('RF', RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=RANDOM_STATE)),
    ('GB', GradientBoostingClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=RANDOM_STATE)),
    ('LR', LogisticRegression(random_state=RANDOM_STATE))
]
In [108]:
plt.clf()
for model in models:
    model_name = model[0]
    model_instance = model[1]
    model_instance.fit(X_train, np.ravel(y_train))
    predictions = model_instance.predict_proba(X_test)[:,1]
    auc_score = metrics.roc_auc_score(y_test, predictions)
    print('ROC AUC Score for {}: {}'.format(model_name, auc_score))
    fpr, tpr, _ = metrics.roc_curve(y_test, predictions)
    plt.plot(fpr, tpr, label='ROC Curve for {} - Area: {:2f}'.format(model_name, auc_score))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend(loc="lower right")
plt.title('ROC curve')
plt.show()
ROC AUC Score for DT: 0.8186580594634724
ROC AUC Score for KNN: 0.6969406951439943
ROC AUC Score for RF: 0.8281936835808315
ROC AUC Score for GB: 0.8410159831483136
ROC AUC Score for LR: 0.8384305025425678
In [109]:
#%%time
#profile = ProfileReport(df, title="Pandas Profiling Report")

#profile.to_file("reports/EDA.html")